# Visual-Language Understanding
VL Rethinker 72B 4bit
Apache-2.0
VL-Rethinker-72B-4bit is a multimodal model based on Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks, and has been converted to MLX format for efficient operation on Apple devices.
Text-to-Image
Transformers English

V
mlx-community
26
0
3B Curr ReFT
Apache-2.0
A multimodal large language model fine-tuned from Qwen2.5-VL using the innovative Curr-ReFT method, significantly enhancing visual-language understanding and reasoning capabilities.
Text-to-Image
3
ZTE-AIM
37
3
Kosmos 2 Patch14 224
MIT
Kosmos-2 is a multimodal large language model capable of understanding and generating text descriptions related to images, and establishing associations between text and image regions.
Image-to-Text
Transformers

K
microsoft
171.99k
162
Mengzi Oscar Base Caption
Apache-2.0
A Chinese multimodal image captioning model fine-tuned on the AIC-ICC Chinese image caption dataset, based on the Mengzi-Oscar pretrained model
Image-to-Text
Transformers Chinese

M
Langboat
23
2
Featured Recommended AI Models